[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) by JulianCloudNTH · Pull Request #20226 · pytorch/executorch

JulianCloudNTH · 2026-06-11T23:41:43Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167
[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227
-> [ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Adds the et_vk.linear_q4gsw operator (4-bit groupwise-symmetric weight-only linear) to the WebGPU backend: dequantize the packed int4 weight in WGSL ((q-8)*scale) and accumulate an fp32 matmul, consuming the serialized [N, K/2] uint8 weight directly (no prepack), one workgroup per output row. Mirrors the Vulkan reference (backends/vulkan/.../impl/QuantizedLinear.cpp). The dispatch carries a linear_q4gsw label for GPU-timestamp-query profiling (mirroring the SDPA kernels). The numerical test suite is in the stacked test diff.

Differential Revision: D108312283

[ghstack-poisoned]

pytorch-bot · 2026-06-11T23:41:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20226

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 1 New Failure, 2 Pending, 2 Unrelated Failures

As of commit 8e81436 with merge base 5526971 ():

NEW FAILURE - The following job has failed:

pull / test-qnn-models-linux (dl3) / linux-job (gh)
RuntimeError: Command docker exec -t d0b7bfb027ab99b275660677293e67b83d6149e426d5958d59094fbec2f1d313 /exec failed with exit code 92

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-11T23:42:31Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-11T23:43:02Z

@claude review

claude · 2026-06-11T23:43:47Z

Claude finished @JulianCloudNTH's task in 4m 47s —— View job

Claude Code Review

Read changed files
Review WGSL kernel
Review C++ dispatch & resource management
Cross-check against backend conventions (SDPA, codegen)
Post feedback

Overall this is a clean, well-guarded first cut. Validation-before-allocation, the fp32-only byte-size checks, the K_packed == ceil(K/2) guard, and the resource-release pattern all match the existing SDPA op conventions, and the generated header's wgsl-sha256 matches the .wgsl source (verified locally), so codegen is in sync. Comments are appropriately minimal. A few correctness edge cases are worth tightening before this lands.

Potential bugs / correctness

1. Weight bound as array<u32> can over-read the last word when N * K_packed isn't a multiple of 4 — QuantizedLinear.cpp:198, q4gsw_linear.wgsl:45

The weight is bound as array<u32> with size = weight.nbytes = N * K_packed bytes. The shader computes t_weight[byte_idx >> 2u] with a maximum byte_idx = N*K_packed - 1, so the highest word index is (N*K_packed - 1) >> 2. When N * K_packed is not a multiple of 4, the effective array<u32> length is floor(nbytes / 4), and that final word read falls outside the bound — WebGPU robustness clamps it to 0, silently producing wrong values for the affected output column(s). This is shape-dependent (e.g. any N*K_packed ≡ 1,2,3 (mod 4)), so it can pass on aligned test shapes and fail in the wild.

Consider validating (N * K_packed) % 4 == 0, or rounding the weight binding size up to a 4-byte multiple, or indexing the weight as array<u32> only after asserting alignment. Worth a dedicated test shape with an odd N*K_packed.

Fix this →

2. scales bounds (num_groups, padded_N) are not validated against the indices the shader uses — QuantizedLinear.cpp:69-70, q4gsw_linear.wgsl:54

The shader indexes t_scales[(k / group_size) * padded_N + n], with max row (K-1)/group_size and max col N-1. The current byte-size guard only checks scales.nbytes == num_groups * padded_N * 4; it never verifies that num_groups >= ceil(K / group_size) or that padded_N >= N. If either is smaller than expected (mismatched serialization), the kernel reads OOB scales (robustness-clamped, wrong results) rather than failing loudly. Adding if (num_groups < (K + group_size - 1) / group_size || padded_N < N) throw … would catch malformed inputs at dispatch time. Note group_size is read at line 96-102, so this check belongs after that.

3. M = in_numel / K truncates silently if in_numel % K != 0 — QuantizedLinear.cpp:66

A non-multiple input numel would silently drop the remainder into a smaller M rather than erroring. A quick if (in_numel % K != 0) throw … keeps the failure mode explicit, consistent with the other malformed-dims guards.

Performance (non-blocking)

4. Scale reload + integer division every K iteration — q4gsw_linear.wgsl:54

(k / params.group_size) * padded_N + n recomputes an integer division and reloads the scale on every inner iteration, even though the scale only changes when k crosses a group boundary. Since this is the matmul hot loop, hoisting the scale fetch to update only at group boundaries (or iterating group-by-group) would cut a division + global load per MAC. Fine to defer if this kernel is correctness-first for now, but worth a // TODO.

Minor / style

QuantizedLinear.cpp:151 declares WGPUBindGroupLayoutEntry entries[6] and the SDPA op factored this into a shared build_dispatch helper (ops/sdpa/Sdpa.cpp:150). Not required, but if more quantized-linear variants are coming in the stack, reusing that helper would avoid duplicating the bind-group/pipeline boilerplate.
static_assert(sizeof(Q4gswParams) == 32, …) is a nice guard against accidental struct drift — good.
Error strings consistently prefixed WebGPU linear_q4gsw: — good for triage.

None of the above are blockers except items 1–2, which I'd recommend guarding (or covering with an odd-shape test) before relying on this in production, since both fail silently via robustness clamping rather than throwing.

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

…near_q4gsw) Pull Request resolved: #20226 Adds the `et_vk.linear_q4gsw` operator (4-bit groupwise-symmetric weight-only linear) to the WebGPU backend: dequantize the packed int4 weight in WGSL (`(q-8)*scale`) and accumulate an fp32 matmul, consuming the serialized `[N, K/2]` uint8 weight directly (no prepack), one workgroup per output row. Mirrors the Vulkan reference (`backends/vulkan/.../impl/QuantizedLinear.cpp`). The dispatch carries a `linear_q4gsw` label for GPU-timestamp-query profiling (mirroring the SDPA kernels). The numerical test suite is in the stacked test diff. ghstack-source-id: 392908894 @exported-using-ghexport Differential Revision: [D108312283](https://our.internmc.facebook.com/intern/diff/D108312283/)

@JulianCloudNTH

…near_q4gsw) (#20262) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #20226 by @JulianCloudNTH ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/23/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/23/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/20/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/23/orig Differential Revision: [D108312283](https://our.internmc.facebook.com/intern/diff/D108312283/) @diff-train-skip-merge --------- Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>

Update

bab43f6

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 11, 2026 23:41

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026

Update

8e81436

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 12, 2026

SS-JIA approved these changes Jun 12, 2026

View reviewed changes

meta-codesync Bot merged commit ec31143 into gh/JulianCloudNTH/23/base Jun 13, 2026
178 of 188 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/23/head branch June 13, 2026 00:00

meta-codesync Bot temporarily deployed to cherry-pick-bot June 13, 2026 00:00 Inactive

pytorchbot mentioned this pull request Jun 13, 2026

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw)#20226

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw)#20226
meta-codesync[bot] merged 2 commits into
gh/JulianCloudNTH/23/basefrom
gh/JulianCloudNTH/23/head

JulianCloudNTH commented Jun 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

JulianCloudNTH commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JulianCloudNTH commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20226

❗ 1 Active SEVs

❌ 1 New Failure, 2 Pending, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 11, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Potential bugs / correctness

Performance (non-blocking)

Minor / style

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 11, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 11, 2026 •

edited

Loading